Is ignorance bliss? Knowledge of privacy issues vs. one’s level of concern

Can following the news on data privacy issues affect an individual’s concern about their own privacy? This report seeks to find what is behind someone’s concern about data privacy.

Luísa Shida
Photo by Towfiqu Barbhuiya

Figure 1: Photo by Towfiqu Barbhuiya

Introduction

Does an individual’s level of education and knowledge on how their data is being used and shared digitally influence their attitude towards data sharing? In this study, I intend on analyzing whether individuals who are more familiar with the framework of data privacy tend to be more or less granular in their relationship with data sharing, that is, whether they are more or less cautious when choosing to share their information with companies, the government, and other institutions. I expect that individuals who have more knowledge on issues of data privacy will be more protective of their data and more careful in how they engage with technology, while individuals with less familiarity with the notions of data privacy will tend to be less attentive and more permissive with their data-sharing choices. I believe this hypothesis might be true since individuals with more knowledge on data privacy will most likely be more aware of the potential dangers that surround sharing personal information with companies and services, therefore looking to prevent these dangers from happening to them. On the other hand, individuals who are not aware of the perils of data sharing might not be afraid of engaging with technology more freely, given that they might not fully comprehend the risks involved with sharing their data.

Data

Dataset overview

The data used in this study is taken from the June 2019 survey “American Trends Panel Wave 49”, carried out by the company Ipsos and sponsored by the Pew Research Center for the People & the Press. Respondents were US adults randomly selected from phone number random-digit-dial surveys, answering a series of multiple-choice questions on issues related to technology, social media, surveillance and data privacy. The survey had a total of 4272 participants; this is an overview of the dataset:

library(tidyverse)
library(dplyr)
dataprivacy <- read_csv("dataprivacy.csv")
rmarkdown::paged_table(dataprivacy)

Study Overview

This observational study is cross-sectional, analyzing the relationships between different variables in accordance with the survey’s data. We are particularly interested in studying the potential relationships between our dependent variable, whether the respondent is concerned with their data privacy or not, and their knowledge of data privacy issues. In regards to quantifying one’s “knowledge of data privacy issues”, this study will consider whether or not respondents follow privacy issues in the news (privacynews). In order to address confounders and study other potential relationships, we will also include analyses with the independent variables of age group, educational attainment, and whether or not the respondent has had one of their social media accounts hacked in the last 12 months.

For the purposes of this study, the ordinal variables of interest were renamed, and to facilitate the calculation of regressions, I created corresponding numeric variables to some of the ordinal variables, attributing numerical values for their possible responses, as follows:

Original Variable Name Renamed Variable Corresponding Numeric Variable Description Possible responses
CONCERNCO_W49 concern_level concern_level_num How concerned respondent is with how companies use their data 4 = Very concerned, 3 = somewhat concerned, 2 = not too concerned, 0 = not at all concerned, Refused (removed from dataset)
F_AGECAT age_group - Respondent’s age group 18-29yrs, 30-49yrs, 50-64yrs, 65+yrs
F_EDUCCAT education - Respondent’s educational attainment 3 = College graduate+ (college degree and above), 2 = some college (unfinished college degree), 1 = high school graduate or less, 0 = don’t know/refused
PRIVACYNEWS1_W49 privacynews privacynews_num How closely respondent follows news about privacy issues 3 = Very closely, 2 = somewhat closely, 1 = not too closely, 0 = not at all closely
DB1b_W49 hacking - Whether or not respondent has had a social media or email account taken over without permission 1 = Yes, 0 = no, 0 = offline (doesn’t have internet access)

Given that analyzing the relationship between one’s level of concern about data privacy and other variables is the point of this study, I have also filtered out survey respondents who did not answer the question about their concern level or how closely they followed privacy news. This results in a total number of 2132 respondents in the tidied data frame, out of the original 4272.

Tidying up the data

dataprivacy_tidy <- dataprivacy |>
  dplyr::select(F_AGECAT, F_EDUCCAT, PRIVACYNEWS1_W49, CONCERNCO_W49, DB1b_W49) |>
    rename(age_group = F_AGECAT, 
         education = F_EDUCCAT, 
         privacynews = PRIVACYNEWS1_W49, 
         concern_level = CONCERNCO_W49,
         hacking = DB1b_W49) |> 
  drop_na(concern_level) |>
  filter(!concern_level == "Refused", !privacynews == "Refused") |>
  mutate(privacynews_num= case_when(privacynews == "Not at all closely" ~ 0,
                                    privacynews == "Not too closely" ~ 1,
                                    privacynews == "Somewhat closely" ~ 2,
                                    privacynews == "Very closely" ~ 3), 
         concern_level_num = case_when(concern_level == "Not at all concerned" ~ 0, 
                                       concern_level == "Not too concerned" ~ 1,
                                       concern_level == "Somewhat concerned" ~ 2,
                                       concern_level == "Very concerned" ~ 3),
         hacking_num = if_else(hacking == "Yes", 1, 0))

rmarkdown::paged_table(dataprivacy_tidy)

Visualizing the dependent variable

The dependent variable for this study is one’s level of concern regarding their data privacy (concern_level). This variable corresponds to the survey question:

How concerned are you, if at all, about how companies are using the data they collect about you?

The following plot and table summarize the number of respondents per level of concern with data privacy in accordance to their responses:

plot_concernlevel <- dataprivacy_tidy |>
  ggplot(mapping = aes(
    x = concern_level,
    fill = concern_level)) + 
  geom_bar() +
  labs(title = "Number of Respondents per Level of Concern about Data Privacy", 
       x = "Level of Concern", 
       y = "Number of Respondents") + 
  theme(legend.position = "none") +
  scale_fill_manual(values = wes_palette("Zissou1", 4, type = "continuous"))

plot_concernlevel
count_concern_level <- dataprivacy_tidy |> group_by(concern_level) |> count(concern_level)
knitr::kable(count_concern_level, col.names = c("Level of Concern", "Number of Respondents"))
Level of Concern Number of Respondents
Not at all concerned 59
Not too concerned 368
Somewhat concerned 919
Very concerned 786

As we can see from the plot and table above, most of the participants were on the higher end of the concern scale - “somewhat concerned” (919 respondents) or “very concerned” (786).

Results

Visualizing relationships of interest

Before we run the regressions between the variables, we will plot the relationships between the dependent variable (level of concern) and each independent variable of interest in order to visualize potential associations, in spite of not yet being able to draw any causal relationships between them since we cannot address confounders or potential biases.

1. Level of concern x Knowledge of privacy issues (how closely they follow data privacy news)
p_concernxprivacynews <- dataprivacy_tidy |> 
  ggplot(mapping = aes(
    x = concern_level,
    fill = privacynews)) + 
  geom_bar(position = "dodge") +
  scale_fill_manual(values = wes_palette("GrandBudapest2", 4, type = "continuous")) +
  labs(title = "Level of Concern x Data Privacy News Following",
       x = "Level of Concern",
       y = "Number of Respondents",
       fill = "How closely respondent
follows privacy news") + 
   theme(axis.text.x = element_text(size = 7))
p_concernxprivacynews

The plot above shows us that most respondents follow privacy news “somewhat closely”, especially when their level of concern is higher (“somewhat concern” or “very concerned”); the highest number of individuals who follow news “very closely” is within the “very concerned” group. The predominant trend in news following for the groups with lesser concern is “not too closely.” Therefore, it seems that the respondents with higher concern levels tend to follow news more closely.

2. Level of Concern x Age Group
p_concernxage <- dataprivacy_tidy |> 
  group_by(age_group) |>
  ggplot(mapping = aes(
    x = age_group,
    fill = concern_level)) + 
  geom_bar(position = "dodge") +
  scale_fill_manual(values = wes_palette("GrandBudapest1", 4, type = "continuous")) +
  labs(title =  "Age Group x Level of Concern",
       x = "Age Group",
       y = "Number of Respondents",
       fill = "Level of Concern")

age_concern_prop <- dataprivacy_tidy |> 
  group_by(age_group) |> 
  summarize(mean_concern = mean(concern_level_num))

p_concernxage
knitr::kable(age_concern_prop, col.names = c("Age Group", "Mean Concern Level"))
Age Group Mean Concern Level
18-29 1.944281
30-49 2.153495
50-64 2.202454
65+ 2.178794

From the plot above, we can see that all age groups had a majority of its respondents fall within the “somewhat concerned” category, followed subsequently by “very concerned”, “not too concerned”, and “not at all concerned.” The table above indicates that the 50-64 age group has the highest average level of concern, followed by the 65+ group, then the 30-49 group, and finally the 18-29 group, with the lowest mean concern level of 1.9 (the younger group, although having a majority of respondents stating that they are “somewhat concerned”, falls under an average of the “Not too concerned” category).

3. Level of Concern x Educational Attainment
p_concernxeducation <- dataprivacy_tidy |> 
  group_by(education) |>
  filter(!education == "Don't know/Refused") |>
  mutate(education = case_when(education == "H.S. graduate or less" ~ "1. High school graduate or less",
                               education == "Some College" ~ "2. Partial college education",
                               education == "College graduate+" ~ "3. Completed college degree and above")) |>
  ggplot(mapping = aes(
    x = education,
    fill = concern_level)) + 
  geom_bar(position = "dodge") +
  scale_fill_manual(values = wes_palette("Rushmore1", 4, type = "continuous")) +
  labs(title =  "Educational Attainment x Level of Concern",
       x = "Educational Attainment",
       y = "Number of Respondents",
       fill = "Level of Concern") +
  theme(axis.text.x = element_text(size = 7))

education_concern_prop <- dataprivacy_tidy |> 
  group_by(education) |> 
  filter(!education == "Don't know/Refused") |>
  summarize(mean_concern = mean(concern_level_num))

p_concernxeducation
knitr::kable(education_concern_prop, col.names = c("Educational Attainment", "Mean Concern Level"))
Educational Attainment Mean Concern Level
College graduate+ 2.147500
H.S. graduate or less 2.072603
Some College 2.211055

From the plot and table above, we can see that individuals who did not complete college but partially attended had the highest average concern level, followed by college graduates and above, and then high school graduates or less. In all three groups, the majority’s sentiment was “somewhat concerned”, followed by “very concerned”, “not too concerned”, and then “not at all concerned”.

4. Level of Concern x Victim of Hacking For this specific variable, we’re going to treat this study almost as if it were a difference-in-means and focus on whether there is a relationship between individuals who have had suffered from others taking over their social media without permission (which we will consider the “treatment” group) in the last 12 months and those who haven’t (“control” group). Ultimately, this will be interesting to assess whether people who have suffered from cybersecurity breaches are more concerned with their safety and privacy online. A sidenote is that this is not a completely accurate difference-in-means, given that we have very little information on what kind of hacking incident the respondents might have suffered - however, it can still make for an interesting analysis in our study.

cyber_breach <- dataprivacy_tidy |> 
  select(hacking_num, concern_level_num) |> 
  group_by(hacking_num) |>
  mutate(treatment = if_else(hacking_num == "1", "Treated", "Control"))

cyber_breach_ate <- cyber_breach |> 
  group_by(treatment) |> 
  drop_na() |> 
  summarize(mean_concern = mean(concern_level_num)) |>
  pivot_wider(names_from = treatment, values_from = mean_concern) |>
  mutate(ATE = Treated - Control)

knitr::kable(cyber_breach_ate, digits = 2, col.names = c("Control Group Mean Level of Concern", "Treated Group Mean Level of Concern", "ATE"))
Control Group Mean Level of Concern Treated Group Mean Level of Concern ATE
2.12 2.33 0.2

As we can see in the plot and table above, the mean privacy concern level of the treated group (that is, individuals who have suffered social media take overs without consent) is higher than that of the control group (those who have not suffered social media hacking). In order to better analyze this discrepancy, we can take a look at the average treatment effect (ATE) between these means - which is the positive value of 0.20. Although we cannot interpret this causally since many confounders have not been yet addressed, the ATE does confirm that the treated group of this study has higher data privacy and cybersecurity concerns than the control group.

Regressions

1. Main Variable Regression

The main analysis of interest is the potential relationship between the level of concern over data privacy (concern_level_num) and knowledge on privacy issues, quantified by how closely one follows news about data privacy (privacynews). Therefore, the regression below is ran only using these two variables.

main_fit <- lm(concern_level_num ~ privacynews, data = dataprivacy_tidy)

modelsummary::modelsummary(list(Regression = main_fit), 
                           statistic = c("s.e. = {std.error}", "p = {p.value}"), 
                           gof_map = c("nobs", "r.squared", "adj.r.squared"), 
                           fmt = 2, 
                           coef_rename = c(privacynews_num = "Privacy News Following"))
Regression
(Intercept) 1.78
s.e. = 0.06
p = <0.01
privacynewsNot too closely 0.11
s.e. = 0.07
p = 0.11
privacynewsSomewhat closely 0.48
s.e. = 0.06
p = <0.01
privacynewsVery closely 0.82
s.e. = 0.08
p = <0.01
Num.Obs. 2132
R2 0.097
R2 Adj. 0.096

Given that privacynews is a categorical variable, the intercept in the regression above tells us that the average baseline concern level for respondents who do not follow privacy news at all (the ommitted group) is 1.78. The other coefficients tell us the difference in concern level between each other group and the group of individuals who do not follow the news. In other words, the regression above tells us that:

Working under an alpha of 0.1, not all of these numbers would therefore all be statistically significant - the p-value for the “not too closely” group is 0.11. Still, the regression results are consistent with my initial hypothesis, since the group with the highest average concern level is that which follows privacy news very closely. Nonetheless, given that many confounders have not been addressed, we cannot interpret this relationship as necessarily causal.

2. Adressing Confounders - Multivariable Regression

In order to reach a more accurate representation of the relationship between level of concern about data privacy and knowledge on privacy issues based on news following, we can try to control for some confounders. Thus, we will run a multivariate regression model including the independent variables age group, educational attainment level and whether or not individuals had one of their social media or email accounts hacked in the past 12 months.

The main reasoning behind considering age group as a confounder is that we must take the age digital divide into account in our study. As a study by the European Money and Finance Forum states, older generations are less likely to use technology and digital services than younger generations. Since there is a discrepancy in how different age groups use of technology and thus what their relationship with data is like, we should investigate whether and how that can affect how they perceive data privacy. Educational attainment is also an interesting variable to study as a possible confounder - would a higher educational degree in any way be associated with more concern over data privacy? A A 2016 survey by the EDUCAUSE Center for Analysis and Research (ECAR) revealed that one-third of undergraduate students were concerned that technology might lead to privacy invasions, for instance - will this finding hold in our study? Finally, the last potential confounder we will include in the regression is whether the respondent has suffered a hacking attack in the past 12 months. This is interesting since a possible hypothesis is that individuals who have had negative experiences with data privacy breaches would be more concerned about their data privacy after the event, as well as more knowledgeable on how breaches happen.

multivariate_dataprivacy <- dataprivacy_tidy |> 
  filter(!education == "Don't know/Refused",
         !hacking == "Refused",
         !hacking == "Offline: does not have internet")

multi_fit <- lm(concern_level_num ~ privacynews + age_group + education + hacking, 
                data = multivariate_dataprivacy)

modelsummary::modelsummary(list
                           (Regression = multi_fit),
                           statistic = c("s.e. = {std.error}", "p = {p.value}"), 
                           gof_map = c("nobs", "r.squared", "adj.r.squared"), 
                           fmt = 2)
Regression
(Intercept) 1.65
s.e. = 0.07
p = <0.01
privacynewsNot too closely 0.09
s.e. = 0.07
p = 0.18
privacynewsSomewhat closely 0.45
s.e. = 0.06
p = <0.01
privacynewsVery closely 0.79
s.e. = 0.08
p = <0.01
age_group30-49 0.17
s.e. = 0.05
p = <0.01
age_group50-64 0.18
s.e. = 0.05
p = <0.01
age_group65+ 0.15
s.e. = 0.05
p = <0.01
educationH.S. graduate or less −0.06
s.e. = 0.04
p = 0.15
educationSome College 0.07
s.e. = 0.04
p = 0.07
hackingYes 0.15
s.e. = 0.06
p = <0.01
Num.Obs. 2063
R2 0.112
R2 Adj. 0.108

From the regression above, we can gather plenty of information about how these variables relate to one’s level of concern. However, when adopting an alpha of 0.1, the most interesting observations are:

It is interesting to compare the results from this regression to the results from the previous regression, focusing on comparing the outcomes for the privacynews variable. The values for each of the groups (“not too closely”, “somewhat closely”, “very closely”) are very close, but not the same. Therefore, we can see that trying to address some confounders did change the results of our regression.

privacynews group Simple Regression Coefficient Multivariate Regression Coefficient
Not too closely 0.11 (but not statistically significant) 0.09 (but not statistically significant)
Somewhat closely 0.48 0.45
Very closely 0.82 0.79

Conclusion

References

Roper iPoll by the Roper Center for Public Opinion Research. https://ropercenter.cornell.edu/ipoll/study/31116657/questions?text=college#c5b33662-fd89-426c-8923-ff1d530f920f. Acesso em 13 December 2023.

Cover photo by Towfiqu Barbhuiya. Towfiqu barbhuiya na Unsplash

“Population ageing and the digital divide, SUERF Policy Brief .:. SUERF - The European Money and Finance Forum”. SUERF.ORG, https://www.suerf.org/suerf-policy-brief/40251/population-ageing-and-the-digital-divide. 13 December de 2023.

D. Christopher Brooks, ECAR Study of Undergraduate Students and Information Technology, EDUCAUSE (2016), p. 22